8.1. Recreate the simulated data from Exercise 7.2:
Overall
V1 8.83890885
V2 6.49023056
V3 0.67583163
V4 7.58822553
V5 2.27426009
V6 0.17436781
V7 0.15136583
V8 -0.03078937
V9 -0.02989832
V10 -0.08529218
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The model does use predictors V6-V10, but they are not as important in the model as the other predictors. V1 is the most important.
[1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change?
When we add one predictor that is highly correlated with V1, the importance score for V1 decreases. V4 is now the most important predictor.
Overall
V1 6.29780744
V2 6.08038134
V3 0.58410718
V4 6.93924427
V5 2.03104094
V6 0.07947642
V7 -0.02566414
V8 -0.11007435
V9 -0.08839463
V10 -0.00715093
duplicate1 3.56411581
What happens when you add another predictor that is also highly correlated with V1?
When we add yet another predictor that is also highly correlated with V1, the importance of V1 in our model decreases once more.
Overall
V1 6.19983255
V2 6.48305559
V3 0.42757857
V4 7.51171970
V5 2.27902033
V6 0.12585434
V7 0.05716457
V8 -0.08621813
V9 -0.05339077
V10 -0.01113780
duplicate1 3.24628944
duplicate2 1.09060225
The importances using conditional inference trees seem to show the same pattern in the predictors that are chosen by the model. V6-V10 are still less importance, but we do see that V3 has become less important relative to the results of the traditional random forest model.
V1 V2 V3 V4 V5 V6
2.487590376 4.580552877 0.027953446 5.615551510 1.057098660 0.015134623
V7 V8 V9 V10 duplicate1 duplicate2
0.005205066 0.005728994 -0.008507938 -0.010683929 0.690821538 0.067750307
Using an XGBoost tree model, V4 is the most important predictor, followed by V1 and V2. Predictors V6-V10 are less important again.
xgbTree variable importance
Overall
V4 100.0000
V2 76.3687
V1 60.5621
V5 32.5734
V3 32.3700
duplicate1 23.3339
V6 2.2923
duplicate2 1.1489
V8 1.0969
V7 0.3611
V10 0.1584
V9 0.0000
8.2. Use a simulation to show tree bias with different granularities.
Tree selection bias occurs when predictors with a higher number of distinct values are favored over more granular predictors. In our simulation, we create our response variable, y, such that it depends on variable a, so we assume that variable a would be the most important in our tree model. However, it turns out that variable b, which y does not depend on and has more distinct values, ends up being more important in our model than variable a.
Overall
a 0.1102470
b 0.8259556
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
As we increase the bagging fraction, our model uses more data and as we simultaneouslyincrease the learning rate, our model becomes more greedy and identifies fewer predictors. Both of these combined leads the model to concentrate its importance on fewer predictors, thus the difference that we see between the left and the right.
The model on the left would be more predictive of other samples because models perform worser off if the parameters are too high.
Increasing interaction depth would spread importance over each individual predictor so that the most important predictors are less important and the least important predictors become more important.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
First, we load our data, perform imputations and split it into a training and test data set.
Basic Regression Tree:
CART
140 samples
57 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.0999723 1.539236 0.3454827 1.218798
0.1033160 1.536178 0.3455412 1.213621
0.3706544 1.704810 0.2650904 1.358981
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.103316.
RMSE Rsquared MAE
1.1900480 0.5084895 0.9149989
Random Forest:
Random Forest
140 samples
57 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1.337582 0.5824334 1.0839623
29 1.219222 0.6116325 0.9655119
57 1.228168 0.5954942 0.9714635
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 29.
RMSE Rsquared MAE
1.1761812 0.5228770 0.8348245
XGBoost:
eXtreme Gradient Boosting
140 samples
57 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
Resampling results across tuning parameters:
eta max_depth colsample_bytree subsample nrounds RMSE Rsquared
0.3 1 0.6 0.50 50 1.281073 0.5335262
0.3 1 0.6 0.50 100 1.284182 0.5396167
0.3 1 0.6 0.50 150 1.279859 0.5465025
0.3 1 0.6 0.75 50 1.237670 0.5649550
0.3 1 0.6 0.75 100 1.219161 0.5823074
0.3 1 0.6 0.75 150 1.208760 0.5904512
0.3 1 0.6 1.00 50 1.232206 0.5626837
0.3 1 0.6 1.00 100 1.224734 0.5713825
0.3 1 0.6 1.00 150 1.224059 0.5745115
0.3 1 0.8 0.50 50 1.278181 0.5409179
0.3 1 0.8 0.50 100 1.272586 0.5516215
0.3 1 0.8 0.50 150 1.261833 0.5614731
0.3 1 0.8 0.75 50 1.248205 0.5572935
0.3 1 0.8 0.75 100 1.222497 0.5781756
0.3 1 0.8 0.75 150 1.220984 0.5812936
0.3 1 0.8 1.00 50 1.243247 0.5584073
0.3 1 0.8 1.00 100 1.227920 0.5719053
0.3 1 0.8 1.00 150 1.228331 0.5743774
0.3 2 0.6 0.50 50 1.263841 0.5533002
0.3 2 0.6 0.50 100 1.255004 0.5606604
0.3 2 0.6 0.50 150 1.254375 0.5617151
0.3 2 0.6 0.75 50 1.214977 0.5819239
0.3 2 0.6 0.75 100 1.202509 0.5917397
0.3 2 0.6 0.75 150 1.200370 0.5933968
0.3 2 0.6 1.00 50 1.214779 0.5802159
0.3 2 0.6 1.00 100 1.205293 0.5881199
0.3 2 0.6 1.00 150 1.204536 0.5890601
0.3 2 0.8 0.50 50 1.295514 0.5329512
0.3 2 0.8 0.50 100 1.278378 0.5459998
0.3 2 0.8 0.50 150 1.275684 0.5480316
0.3 2 0.8 0.75 50 1.232420 0.5678913
0.3 2 0.8 0.75 100 1.222114 0.5768790
0.3 2 0.8 0.75 150 1.220297 0.5784880
0.3 2 0.8 1.00 50 1.218493 0.5761531
0.3 2 0.8 1.00 100 1.207810 0.5842626
0.3 2 0.8 1.00 150 1.205934 0.5858674
0.3 3 0.6 0.50 50 1.266340 0.5498553
0.3 3 0.6 0.50 100 1.263218 0.5522246
0.3 3 0.6 0.50 150 1.263000 0.5523979
0.3 3 0.6 0.75 50 1.232254 0.5661501
0.3 3 0.6 0.75 100 1.230093 0.5682305
0.3 3 0.6 0.75 150 1.229907 0.5683666
0.3 3 0.6 1.00 50 1.202574 0.5838424
0.3 3 0.6 1.00 100 1.200089 0.5855398
0.3 3 0.6 1.00 150 1.199934 0.5856375
0.3 3 0.8 0.50 50 1.293319 0.5334740
0.3 3 0.8 0.50 100 1.286658 0.5386660
0.3 3 0.8 0.50 150 1.286170 0.5390214
0.3 3 0.8 0.75 50 1.218838 0.5790970
0.3 3 0.8 0.75 100 1.214359 0.5824039
0.3 3 0.8 0.75 150 1.214310 0.5824651
0.3 3 0.8 1.00 50 1.238530 0.5667261
0.3 3 0.8 1.00 100 1.237704 0.5675399
0.3 3 0.8 1.00 150 1.237648 0.5675898
0.4 1 0.6 0.50 50 1.365700 0.4932415
0.4 1 0.6 0.50 100 1.351888 0.5074751
0.4 1 0.6 0.50 150 1.345937 0.5141629
0.4 1 0.6 0.75 50 1.273148 0.5412808
0.4 1 0.6 0.75 100 1.258650 0.5572984
0.4 1 0.6 0.75 150 1.256923 0.5598387
0.4 1 0.6 1.00 50 1.258376 0.5504259
0.4 1 0.6 1.00 100 1.248129 0.5620757
0.4 1 0.6 1.00 150 1.244783 0.5673097
0.4 1 0.8 0.50 50 1.346017 0.5086315
0.4 1 0.8 0.50 100 1.337818 0.5219350
0.4 1 0.8 0.50 150 1.341290 0.5242514
0.4 1 0.8 0.75 50 1.289376 0.5356260
0.4 1 0.8 0.75 100 1.278958 0.5494314
0.4 1 0.8 0.75 150 1.271441 0.5572848
0.4 1 0.8 1.00 50 1.242731 0.5608155
0.4 1 0.8 1.00 100 1.241910 0.5670103
0.4 1 0.8 1.00 150 1.243206 0.5695683
0.4 2 0.6 0.50 50 1.320338 0.5216594
0.4 2 0.6 0.50 100 1.315813 0.5254276
0.4 2 0.6 0.50 150 1.315218 0.5259712
0.4 2 0.6 0.75 50 1.289582 0.5429952
0.4 2 0.6 0.75 100 1.287257 0.5470073
0.4 2 0.6 0.75 150 1.287047 0.5475724
0.4 2 0.6 1.00 50 1.233368 0.5715938
0.4 2 0.6 1.00 100 1.227122 0.5766626
0.4 2 0.6 1.00 150 1.225574 0.5776521
0.4 2 0.8 0.50 50 1.316470 0.5256837
0.4 2 0.8 0.50 100 1.309982 0.5304919
0.4 2 0.8 0.50 150 1.309166 0.5311503
0.4 2 0.8 0.75 50 1.254402 0.5649274
0.4 2 0.8 0.75 100 1.244670 0.5714362
0.4 2 0.8 0.75 150 1.243009 0.5726216
0.4 2 0.8 1.00 50 1.254566 0.5621069
0.4 2 0.8 1.00 100 1.251743 0.5656487
0.4 2 0.8 1.00 150 1.251135 0.5661447
0.4 3 0.6 0.50 50 1.330358 0.5118063
0.4 3 0.6 0.50 100 1.327620 0.5134157
0.4 3 0.6 0.50 150 1.327319 0.5136295
0.4 3 0.6 0.75 50 1.297246 0.5378903
0.4 3 0.6 0.75 100 1.295638 0.5390569
0.4 3 0.6 0.75 150 1.295633 0.5390587
0.4 3 0.6 1.00 50 1.282014 0.5381810
0.4 3 0.6 1.00 100 1.281122 0.5389368
0.4 3 0.6 1.00 150 1.281114 0.5389417
0.4 3 0.8 0.50 50 1.303177 0.5338694
0.4 3 0.8 0.50 100 1.300635 0.5354077
0.4 3 0.8 0.50 150 1.300469 0.5354850
0.4 3 0.8 0.75 50 1.263667 0.5582641
0.4 3 0.8 0.75 100 1.262950 0.5589727
0.4 3 0.8 0.75 150 1.262950 0.5589797
0.4 3 0.8 1.00 50 1.228379 0.5705836
0.4 3 0.8 1.00 100 1.227866 0.5709596
0.4 3 0.8 1.00 150 1.227847 0.5709701
MAE
0.9968671
1.0047690
0.9992347
0.9736137
0.9560512
0.9476665
0.9687589
0.9595069
0.9588250
1.0025958
0.9884565
0.9806110
0.9721616
0.9524760
0.9526839
0.9824306
0.9613286
0.9611580
0.9896431
0.9783558
0.9780415
0.9478509
0.9375196
0.9357242
0.9462404
0.9393341
0.9391572
1.0006740
0.9879809
0.9842625
0.9456932
0.9326957
0.9320749
0.9447736
0.9340364
0.9324208
0.9956355
0.9936119
0.9933951
0.9605906
0.9578711
0.9576292
0.9348046
0.9323701
0.9322187
1.0012237
0.9943317
0.9942661
0.9500726
0.9470249
0.9468911
0.9664168
0.9649426
0.9648693
1.0734340
1.0566926
1.0503838
1.0101949
0.9930406
0.9907811
0.9894004
0.9756055
0.9718447
1.0560798
1.0465640
1.0484434
1.0152241
1.0003764
0.9938049
0.9788495
0.9689052
0.9664942
1.0358751
1.0303242
1.0291131
0.9958917
0.9936324
0.9935035
0.9597746
0.9550805
0.9538157
1.0333404
1.0278694
1.0277482
0.9673061
0.9585069
0.9573300
0.9693101
0.9666240
0.9661635
1.0423192
1.0393788
1.0392871
1.0046807
1.0035617
1.0035599
0.9999953
0.9995995
0.9996010
1.0416721
1.0389396
1.0387409
0.9645330
0.9634168
0.9634135
0.9491362
0.9486436
0.9486314
Tuning parameter 'gamma' was held constant at a value of 0
Tuning
parameter 'min_child_weight' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 150, max_depth = 3, eta
= 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
= 1.
RMSE Rsquared MAE
1.2570559 0.4949626 0.8991779
We considered Basic Regression, Random Forest and XGBoost tree models, and Random Forest performed slightly better with an R^2 of 0.53.
The optimal linear model (elastic net) had 3 biological predictors and they were not as important as the manufacturing processes. The optimal nonlinear model (SVM) incorporated more predictors and had one additional biological predictor in its list of top predictors. Most of the top predictors in our Random Forest model are manufacturing processes, but interestingly enough, one biological predictor (BiologicalMaterial12) ranks third highest in this model, and this differentiates it from the other models where most predictors were manufacturing processes and those also ranked highest.
rf variable importance
only 20 most important variables shown (out of 57)
Overall
ManufacturingProcess32 100.000
ManufacturingProcess13 39.807
BiologicalMaterial12 34.167
ManufacturingProcess17 25.162
ManufacturingProcess31 19.495
BiologicalMaterial03 16.679
ManufacturingProcess09 16.276
BiologicalMaterial06 12.736
ManufacturingProcess06 11.094
ManufacturingProcess11 10.634
BiologicalMaterial11 10.103
ManufacturingProcess36 8.545
BiologicalMaterial05 7.346
ManufacturingProcess24 7.084
BiologicalMaterial02 7.045
BiologicalMaterial04 5.823
ManufacturingProcess18 5.607
ManufacturingProcess28 5.193
BiologicalMaterial08 4.852
ManufacturingProcess30 4.735
It does not make sense to plot a single tree for a Random Forest or XGBoost model, but we will plot the basic regression tree that underperformed against these models. There are only two splits here and both deal with manufacturing processes, so this follows the pattern that biological material predictors are not as important.
---
title: "DATA624 Beverage Manufacturing Modeling Final Project"
author: "Omar Pineda, Calvin Wong, Murali Kunissery"
date: "4/13/2020"
output:
flexdashboard::flex_dashboard:
orientation: rows
source_code: embed
---
Sidebar {.sidebar}
-------------------------------------
### Assignment
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
Row {.tabset .tabset-fade}
-------------------------------------
### Non-technical Report
### Technical Report
```{r}
library(caTools)
library(DMwR)
library(mlbench)
library(randomForest)
library(caret)
library(rpart)
```
8.1. Recreate the simulated data from Exercise 7.2:
```{r}
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
```
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
```{r}
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
```
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The model does use predictors V6-V10, but they are not as important in the model as the other predictors. V1 is the most important.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
```{r}
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
```
Fit another random forest model to these data. Did the importance score for V1 change?
When we add one predictor that is highly correlated with V1, the importance score for V1 decreases. V4 is now the most important predictor.
```{r}
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
```
What happens when you add another predictor that is also highly correlated with V1?
When we add yet another predictor that is also highly correlated with V1, the importance of V1 in our model decreases once more.
```{r}
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .2
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
```
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
The importances using conditional inference trees seem to show the same pattern in the predictors that are chosen by the model. V6-V10 are still less importance, but we do see that V3 has become less important relative to the results of the traditional random forest model.
```{r}
library(party)
cmodel1 <- cforest(y ~ ., data = simulated)
varimp(cmodel1, conditional = TRUE)
```
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Using an XGBoost tree model, V4 is the most important predictor, followed by V1 and V2. Predictors V6-V10 are less important again.
```{r}
xgbmodel1 <- train(y ~ ., data = simulated, method = "xgbTree")
rfImpx1 <- varImp(xgbmodel1)
rfImpx1
```
8.2. Use a simulation to show tree bias with different granularities.
Tree selection bias occurs when predictors with a higher number of distinct values are favored over more granular predictors. In our simulation, we create our response variable, y, such that it depends on variable a, so we assume that variable a would be the most important in our tree model. However, it turns out that variable b, which y does not depend on and has more distinct values, ends up being more important in our model than variable a.
```{r}
library(rpart)
set.seed(200)
a <- rep(1:2, each=100)
y <- a + rnorm(200, mean=0, sd=6)
set.seed(400)
b <- rnorm(200, mean=0, sd=3)
dat <- data.frame(y, a, b)
biasMod <- rpart(y ~ ., data = dat)
simImp <- varImp(biasMod)
simImp
```
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for
the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
As we increase the bagging fraction, our model uses more data and as we simultaneouslyincrease the learning rate, our model becomes more greedy and identifies fewer predictors. Both of these combined leads the model to concentrate its importance on fewer predictors, thus the difference that we see between the left and the right.
(b) Which model do you think would be more predictive of other samples?
The model on the left would be more predictive of other samples because models perform worser off if the parameters are too high.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?
Increasing interaction depth would spread importance over each individual predictor so that the most important predictors are less important and the least important predictors become more important.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
First, we load our data, perform imputations and split it into a training and test data set.
```{r}
#load data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
#imputation
cmp <- knnImputation(ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "yield"])
#data splitting
set.seed(101)
sample = sample.split(cmp$BiologicalMaterial01, SplitRatio = .8)
cmp_train = subset(cmp, sample == TRUE)
cmp_test = subset(cmp, sample == FALSE)
cmp_train_X = subset(cmp_train, select = -Yield)
cmp_train_y = cmp_train[,'Yield']
cmp_test_X = subset(cmp_test, select = -Yield)
cmp_test_y = cmp_test[,'Yield']
```
(a) Which tree-based regression model gives the optimal resampling and test set performance?
Basic Regression Tree:
```{r}
cmpst <- train(x = cmp_train_X, y = cmp_train_y, method = "rpart")
cmpst
```
```{r}
cmpstPred <- predict(cmpst, newdata = cmp_test_X)
postResample(pred = cmpstPred, obs = cmp_test_y)
```
Random Forest:
```{r}
cmprf <- train(x = cmp_train_X, y = cmp_train_y, method = "rf")
cmprf
```
```{r}
cmprfPred <- predict(cmprf, newdata = cmp_test_X)
postResample(pred = cmprfPred, obs = cmp_test_y)
```
XGBoost:
```{r}
cmpxgb <- train(x = cmp_train_X, y = cmp_train_y, method = "xgbTree")
cmpxgb
```
```{r}
cmpxgbPred <- predict(cmpxgb, newdata = cmp_test_X)
postResample(pred = cmpxgbPred, obs = cmp_test_y)
```
We considered Basic Regression, Random Forest and XGBoost tree models, and Random Forest performed slightly better with an R^2 of 0.53.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The optimal linear model (elastic net) had 3 biological predictors and they were not as important as the manufacturing processes. The optimal nonlinear model (SVM) incorporated more predictors and had one additional biological predictor in its list of top predictors. Most of the top predictors in our Random Forest model are manufacturing processes, but interestingly enough, one biological predictor (BiologicalMaterial12) ranks third highest in this model, and this differentiates it from the other models where most predictors were manufacturing processes and those also ranked highest.
```{r}
varImp(cmprf)
```
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
It does not make sense to plot a single tree for a Random Forest or XGBoost model, but we will plot the basic regression tree that underperformed against these models. There are only two splits here and both deal with manufacturing processes, so this follows the pattern that biological material predictors are not as important.
```{r}
library(partykit)
plot(as.party(cmpst$finalModel))
```